Image Classification of Cats and Dogs

Author: Anjukutty Joseph

Student ID: R00171244

Objective

The objective of this assignment is to build a machine learning Cat/Dog image classifier.Given dataset consists of 1000 images for traing the model and another 100 images where the model is applied to predict whether the image is of a cat or dog.

In [98]:
# Load all required libraries
import cv2
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import matplotlib. image as matImage
from os import listdir
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import normalize
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
from sklearn.decomposition import PCA

Part A ) To build a classification model , first the training images are loaded and preprocessed. A numpy array is created to save all 1000 images in training set and its corresponding labels are also extracted from file names and saved. Label 0 - cat , 1- dog

In [99]:
# Load Training data
sourceDir = "./data/train/"
# list to store the training images
X_train = list() 
# list to store the training labels
Y_train = list() 
# list to store the validation images
X_dev = list()
# list to store the validation labels
Y_dev = list()
#This loop iterates each image present in the source directory.
# Read image using openCV library and convert all images to common size 250x250
for fileName in listdir(sourceDir):    
    image_l = cv2.imread(sourceDir+fileName)
    image_l = cv2.resize(image_l, (250, 250))
    X_train.append(image_l)
    # image name format is cat.1, So extracting the labels from image name
    name = fileName.split('.')    
    if (name[0] == "cat"):
        Y_train.append(0)
    else:
        Y_train.append(1)
# Verify training images
X_train = np.array(X_train)

Reshaping image array and training and validation data spplitting are done in below section. 80 % of training data i.e 800 images are used for training different models and remaining 20 % are used for validating model.

In [100]:
# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
Y_train = np.array(Y_train)
## Normalise Data
X_train = normalize(X_train,norm = 'max')
# 80% for training and 20% for validation
X_train, X_dev, Y_train, Y_dev = train_test_split(X_train,Y_train, test_size=0.20, random_state=42 )

Some of the sample images in training set are shown below. We can see that , training images have cats and dogs of different breed in different colour with photos taken at multiple angles.

In [65]:
_,ax = plt.subplots(5,6, figsize=(30,30))
for i in range(5):
    for j in range(6):
        ax[i,j].imshow(cv2.cvtColor(X_train[100+(i*100)+j], cv2.COLOR_BGR2RGB))
        ax[i,j].axis('off')        
plt.show()

PART B) This section build multiple classification models and these models are tested on validation data. Three models are tried in this analysis. 1) K Nearest Neighbour 2) Gaussian Naive Bayes 3) Logistic Regression

In [69]:
# create a dataframe to store details of each algorithms tried  
data  = {'modelName':["KNN","NB","LR"],
         'modelObject':[KNeighborsClassifier(n_neighbors=1, n_jobs=-1),GaussianNB(),LogisticRegression(solver = "lbfgs")],
         'accuracy':[None,None,None]}   
modelDetails =pd.DataFrame(data,columns=['modelName','modelObject','accuracy'])
index = 0
# train 3 models and save its accuracy
for model in modelDetails['modelObject']:
    model_fit = model.fit(X_train, Y_train)     
    acc = model_fit.score(X_dev, Y_dev)
    modelDetails.iloc[index,2] = acc       
    index= index+1
print(modelDetails[['modelName', 'accuracy']]) 
c:\users\anjukutty\appdata\local\programs\python\python37\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
  modelName  accuracy
0       KNN  0.537313
1        NB  0.547264
2        LR  0.577114

Three classification algorithms have almost comparable perfomance. However, logistic regression model has highest accuracy (57%) among the three.

Next section focus on feature reduction of images. When it comes to images , number of pixels are number of features. Since the images have 250x250 resolution, the number of features are very high to process. Hence, compressing image is very essential to get an optimal perfomance. This analysis use PCA for dimension reduction. As it is impossible to randomly guess to what size the images dimension needs to be reduced, a check is done to find the number of dimensions that keeps 90% of the variance of the original images.

In [106]:
pca_dims = PCA()
pca_dims.fit(X_train)
cumsum = np.cumsum(pca_dims.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.90) + 1
d
Out[106]:
190

190!! Thats a big reduction in terms of dimension but we are not sue whether the classification model trained on this dimensionally reduced data will perform poor or better. To know that, the three models that is generated before are re generated but this time they are trained on dimensionally reduced data. Below lines of code train PCA on original training image set and transform the data.

In [107]:
pca = PCA(n_components=d)
X_reduced = pca.fit_transform(X_train)
X_recovered = pca.inverse_transform(X_reduced)
print("reduced shape: " + str(X_reduced.shape))
print("recovered shape: " + str(X_recovered.shape))
reduced shape: (801, 190)
recovered shape: (801, 187500)

From 187500 features the training features are reduced to just 190. Using this training set, 3 models are built and its accuracy is displayed on validation sample.

In [108]:
data  = {'modelName':["KNN","NB","LR"],
         'modelObject':[KNeighborsClassifier(n_neighbors=1, n_jobs=-1),GaussianNB(),LogisticRegression(solver = "lbfgs")],
         'accuracy':[None,None,None]}   
modelDetails =pd.DataFrame(data,columns=['modelName','modelObject','accuracy'])
index = 0
# train 3 models and save its accuracy
for model in modelDetails['modelObject']:
    model_fit = model.fit(X_reduced, Y_train)    
    # transform the validation set
    X_dev_reduced = pca.fit_transform(X_dev)
    acc = model_fit.score(X_dev_reduced, Y_dev)
    modelDetails.iloc[index,2] = acc       
    index= index+1
print(modelDetails[['modelName', 'accuracy']]) 
  modelName  accuracy
0       KNN  0.502488
1        NB  0.512438
2        LR  0.437811

Perfomance accuracy of the three models on after feature reduction is shown above. The accuracy is slighlty dropped in this case. But the training time for models is significantly minimised. Logistic regression model shows highest reduction in accuracy after feature reduction. But, Naive Bayes and KNN algorithms almost remained same.

In [112]:
# Train a LR model
model_LR = LogisticRegression(solver = "lbfgs")
model_LR.fit(X_train, Y_train)
Out[112]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Prepare test data .

In [87]:
sourceDir = "./data/test/"
X_test_pre = list()
 # iterate over test image and create list of images in correct shape
for fileName in listdir(sourceDir):
    image_test = cv2.imread(sourceDir+fileName)
    image_test = cv2.resize(image_test, (250, 250))    
    X_test_pre.append(image_test)
X_test = np.stack(X_test_pre, axis=0)
X_test = np.reshape(X_test, (X_test.shape[0], -1))

Logistic classifier model performs the best among the three on original image set. Hence, using this model, the images in test set are classified as dog/ cat below.

In [113]:
# using the best model, classify dog and cat images.
dists = model_LR.predict(X_test)
print(dists)
unique, counts = np.unique(dists, return_counts=True)
dict(zip(unique, counts))
[0 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0
 1 1 0 1 0 1 1 0 0 0 1 1 0 0 0 1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 1 0 0
 0 0 0 1 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0]
Out[113]:
{0: 54, 1: 46}

Among 100 images in test data, 54 are predicted to be cat and 46 are predicted as dogs. Below image display some of the images in test set and it predicted label at the top of image . Some of the predictions are correct and some are wrong but the logistic regression model is better than random guess.

In [94]:
_,ax = plt.subplots(5,6, figsize=(30,30))
for i in range(5):
    for j in range(6):
        ax[i,j].imshow(cv2.cvtColor(X_test_pre[10+(i*2)+j], cv2.COLOR_BGR2RGB))
        ax[i,j].axis('off')    
        if(dists[10+(i*2)+j] == 0):            
            ax[i,j].set_title("cat")
        else:
            ax[i,j].set_title("dog")
plt.show()

PART C) hyperparameter optimization on the final model.

Hyper parameter optimisation is tried on best logistoc regression model using grid search.

Two parameters 'C' and 'penality are tuned.

In [114]:
#  hyperparameter optimization on the final model
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],
                  "penalty":["l1","l2"]}
model_tuned_LR = LogisticRegression()
grid_search = GridSearchCV(
model_tuned_LR, param_grid, cv=10, scoring='accuracy')
grid_search.fit(X_reduced, Y_train)    
bestLR = grid_search.best_estimator_ # choose the best estimator 
print("grid search best score :",grid_search.best_score_)
print("LR best params",grid_search.best_params_)
#train model on best estimator
bestLR.fit(X_reduced, Y_train)
acc = bestLR.score(X_dev_reduced,Y_dev)
print(acc)
grid search best score : 0.5555555555555556
LR best params {'C': 0.01, 'penalty': 'l1'}
0.5074626865671642

The tuned parameters for logistic regression are shown above. After hyper parameter optimisation, the accuracy of the model improved from 43% to 50% on validation dataset.

In [ ]: